Statistical Disclosure Control |
|
Contact:Peter-Paul de WolfStatistics Netherlands P.O. Box 24500 2490 HA The Hague The Netherlands Phone: +31 70 337 5060 ![]() Last update: 10 Oct 2011 |
FAQ: Frequently asked questionsOn this page you will find some frequently asked questions. If you cannot find the answer to your question here, please send your question to
the
European SDC-expert team
Microdatam1. Synthetic datam2. What is microaggregation m3. I only want to apply microaggregation to several continuous scaled variables. In this case, do I have to provide a meta-data (rda) file as well? m4. Could I also use µ-ARGUS if I have population data (no sampling weights). m5. When working with large datasets with many variables, creating a metadata file manual is very time consuming. Is there a way to get metadata files created automatically by usual statistical programs like SAS? m6. Is there a limit of variables or cases µ-ARGUS can handle? m7. Is it possible to run methods in µ-ARGUS on subsets of the data? Tabular datat1. Thresholds in sensitivity rulest2. Licences for τ-ARGUS t3. The frequency rule and the (n,k)-rule implemented in τ-Argus t4. What helps to reduce processing time of secondary cell suppression in τ-ARGUS? t5. Which sensitivity rule should I use? t6. Which of the secondary cell suppression algorithms of τ-ARGUS should I use, Modular, Hypercube, Network, or Optimal? t7. What is the "singleton problem"? Othero1. In which directory can the test data be found after installation of Argus?o2. What are the recommended hard- and software components? Synthetic Data![]() ![]() 1) Synthetic data are normally generated by using more information on the original data than is specified in the model whose preservation is guaranteed by the data protector releasing the synthetic data. 2) As a consequence of the above, synthetic data may offer utility beyond the models they exactly preserve. 3) It is impossible to anticipate all possible statistics an analyst might be interested in. So access to the micro dataset should be granted. 4) Not all users of a public use file will have a sound background in statistics. Some of the users might only be interested in some descriptive statistics and are happy, if they know the right commands in their statistical package to get what they want. They will not be able to generate the results if only the parameters are provided. 5) The imputation models in most applications can be very complex, because different models are fitted for every variable and often for different subsets of the dataset. This might lead to hundreds of parameters just for one variable. Thus, it is much more convenient even for the skilled user of the data to have the synthesized dataset available. 6) The most important reason for not releasing the parameters is that the parameters themselves could be disclosive in some occasions. For that reason, only some general statements about the generation of the public use file should be released. For example, these general statements could provide information, which variables where included in the imputation model, but not the exact parameters. So the user can judge if her analysis would be covered by the imputation model, but she will not be able to use the parameters to disclose any confidential information. ![]() Thresholds in sensitivity rules![]() or in other words: What are the factors to consider when fixing the threshold for a sensitivity rule such as the dominance rule or the p% rule? What is the (positive or negative) influence of each factor on the threshold? ![]() There are many different arguments to choose the value of p. There can be legal restrictions. Also the sensitivity of the table can be an argument to choose for smaller or larger value of p. The SDC-handbook is a valuable source for further reading on the sensitivity rules. ![]() Microaggregation.![]() ![]() To obtain microaggregates in a microdata set with n records, these are combined to form g groups of size at least k. For each attribute, the average value over each group is computed and is used to replace each of the original averaged values. Groups are formed using a criterion of maximal similarity. Once the procedure has been completed, the resulting (modified) records can be published. The optimal k-partition (from the information loss point of view) is defined to be the one that maximizes within-group homogeneity; the higher the within-group homogeneity, the lower the information loss, since microaggregation replaces values in a group by the group centroid. The sum of squares criterion is common to measure homogeneity in clustering. The within-groups sum of squares SSE is defined as SSE = Σ Σ ( (xij - x̄i)' (xij - x̄i) ) The lower SSE, the higher the within-group homogeneity. Thus, in terms of sums of squares, the optimal k-partition is the one that minimizes SSE. For a microdata set consisting of p attributes, these can be microaggregated together or partitioned into several groups of attributes. Also the way to form groups may vary. Several taxonomies are possible to classify the microaggregation algorithms in the literature: i) fixed group size vs variable group size; ii) exact optimal (only for the univariate case vs heuristic microaggregation; iii) continuous vs categorical microaggregation. Microaggregation has recently been proposed as an option to generate hybrid data, combining original data and synthetic data. The idea is to form small aggregates of k records and then, rather than replacing records in an aggregate by an average, replace them by synthetic records preserving the means and covariances of original records in the aggregate. For an illustrative example of the application of microaggregation to real data see also the case study section of the Essnet on SDC web page at: http://neon.vb.cbs.nl/casc/ESSNet/Case studies B2.pdf References: The ESSNet Handbook on SDC gives a further reading of this subject. Josep Domingo-Ferrer, 'Microaggregation', entry for the Encyclopedia of Database Systems, New York: Springer-Verlag, 2009, pp. 1736-1737. ISBN 978-0-387-35544-3 Josep Domingo-Ferrer, 'Microaggregation-based numerical hybrid data', in Joint UNECE/EUROSTAT Work Session on Statistical Disclosure Control, Bilbao, Basque Country, Dec. 2-Dec. 4, 2009. ![]() Licences for τ-ARGUS![]() ![]() However for solving the complex mathematical models behind cell-suppression and also controlled rounding we need to solve large optimisation problems. For that we rely on high-quality solvers like XPress and CPlex. We have included the Open Solver Soplex as well, since version 4.1.0. This solver is free to use for academics and European NSIs. For the use with τ-ARGUS we have negotiated friendly prices with Xpress. Contact support@fico.com on this. Note that for large(r) instances, the Open Solver might not be powerfull enough. ![]() Location of the test datasets.![]() ![]() The software will usually be installed in a subdirectory (mu_argus or tauargus) of "C:\program files". But of course during the installation you are free to choose a different location. ![]() The frequency rule and the p% and (n,k)-rule implemented in τ-ARGUS![]() ![]() For the p% rule and the (n,k) rule the interpretation is that a cell is unsafe if the value of the sensitivity rule is above the threshold. If the value is equal to the threshold, the protection levels would become zero. So no protection would be needed and therefore the cell is considered safe. For more information on protection levels see the SDC-handbook ![]() Subsets of the data in µ-ARGUS![]() ![]() ![]() Meta data if only microaggregation is required![]() In this case, do I have to provide a meta-data (rda) file as well ![]() But when using fixed format microdata, you only have to specify in the RDA-file the variables you will use for your job. The remaining not-specified data will be copied to the output file as-is. ![]() Population files![]() ![]() Only the risk model has been developed especially for sample files. ![]() Large datasets with many variables![]() Is there a way to get metadata files created automatically by usual statistic programs like SAS? ![]() 2. When working with SPSS a similar procedure apllies. Only the needed variables are exported from SPSS. 3. When working with SAS, there are two options. Either the user exports the necessary data to a comma-seperated file with the variable names on the first row. In that case µ-ARGUS can read this first line and create a first verison of the RDA-fike which can be extended. Alternatively the SAS procedures developed during the ESSNet can be used. There are no plans to extend µ-ARGUS to read SAS directly. We think that these export prodecures do a better job. ![]() µ-ARGUS limits![]() ![]() 2. There is no real limit to the number of cases. Of course larger data files require larger processing time. But many steps like global recoding etc. are done after the initial computations. These steps only require aggregated information and will be done very quickly. ![]() Hardware and software requirements![]() ![]() 2Gb of RAM is usually enough ![]() Reduce processing time in τ-ARGUS?![]() ![]() ![]() t5. Which sensitivity rule should I use? Which sensitivity rule![]() ![]() But also cells with a few large (dominating) cells can be risky. Traditionally the dominance (n,k)-rule is used. The sum of the largest (n) contributors should not be more than k% of the cell total. Recently the p%-rule is used more often. This rule states that no contributor to a cell can make a better estimate of another contributor than p%. This rule focuses more on the real problem and also has better behaviour in case of waivers. Therefor the p%-rule is recommended. For further reading we refer to the SDC handbook. ![]() Hardware and software requirements![]() ![]() ![]() Singleton problem![]() ![]() ![]() |